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High throughput genetic sequencing arrays with thousands of 
measurements per sample and a great amount of related censored 
clinical data have increased demanding need for better measurement 
specific model selection. In this paper we establish strong oracle prop- 
erties of nonconcave penalized methods for nonpolynomial (NP) di- 
mensional data with censoring in the framework of Cox's propor- 
tional hazards model. A class of folded-concave penalties are em- 
ployed and both LASSO and SCAD are discussed specifically. We 
unveil the question under which dimensionality and correlation re- 
strictions can an oracle estimator be constructed and grasped. It is 
demonstrated that nonconcave penalties lead to significant reduction 
of the "irrepresentable condition" needed for LASSO model selection 
consistency. The large deviation result for martingales, bearing in- 
terests of its own, is developed for characterizing the strong oracle 
property. Moreover, the nonconcave regularized estimator, is shown 
to achieve asymptotically the information bound of the oracle esti- 
mator. A coordinate-wise algorithm is developed for finding the grid 
of solution paths for penalized hazard regression problems, and its 
performance is evaluated on simulated and gene association study 
examples. 

1. Introduction. A central theme in high-dimensional data analysis is 
efficient discovery of sparsity patterns. For such data, where dimensionality 
possibly grows exponentially faster than the sample size, sparsity structures 
are imposed as means of recovering important signals. Under the linear 
regression model framework, various methods ranging from regularized to 
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marginal regressions and graphical models have been effectively proposed 
for identification, reconstruction and estimation of the unknown sparse re- 
gression parameters. 

With increasing understanding of sparse recovery in these novel high- 
dimensional spaces, more and more attention is paid to efficient discov- 
ery of sparsity patterns for ultra-high dimensional data and great progress 
has been made in the least squares setting. For example, Meinshausen and 
Biihlmann (2006), Zhao and Yu (2006) and Zhang and Huang (2008) inves- 
tigated model selection consistency of LASSO when the number of variables 
is of a greater order than the sample size and Candes and Tao (2007) intro- 
duced the Dantzig selector specifically to handle the NP-dimensional vari- 
able selection problem, and Bunea, Tsybakov and Wegkamp (2007), Bickel, 
Ritov and Tsybakov (2009), van de Geer and Biihlmann (2009), Koltchin- 
skii (2009), Meinshausen and Yu (2009), Massart and Meynet (2010), among 
others, showed their asymptotic or finite sample oracle risk properties for 
fixed or random ill-posed designs. Various versions of the "restricted eigen- 
value condition," "sparse Riesz condition" or "incoherence condition" that 
exclude high correlations among variables play a key role here. On the other 
hand, when the LASSO estimator does not satisfy some of these condi- 
tions, it often selects a model which is overly dense in its effort to relax the 
penalty on the relevant coefficients [Fan and Li (2001), Zhang (2010), Zhang 
and Huang (2008)]. Hence, nonconvex penalties [Fan and Li (2001)] are pro- 
posed where Zhang (2010) pioneered the work with NP-dimensionality and 
demonstrated its sign consistency for p S> n and its advantages over LASSO 
in the sense of attaining minimax convergence rates. Lv and Fan (2009) 
and Fan and Lv (2011) made important connections between finite sample 
and asymptotic oracle properties using folded-concave penalties for the pe- 
nalized least squares estimator with NP-dimensionality. Although extensive 
work has been done for linear regression models, censored survival data have 
been left greatly unexplored for p^$> n. 

Extending oracle results to censored data with NP-dimensionality presents 
a tremendous novel challenge, and, to the best of our knowledge, there is no 
previous work on this topic. The extensions to LASSO and SCAD algorithms 
for survival data were successfully proposed by Tibshirani (1997) and Fan 
and Li (2002), respectively, but both algorithms were theoretically tested 
only when p <n. In recent papers, Johnson (2009), Wang et al. (2009) and 
Du, Ma and Liang (2010) addressed the problem in accelerated failure time 
models, Cox's model and semiparametric relative risk models by combining 
the LASSO, group LASSO and adaptive LASSO penalties, but, likewise, 
they only discussed the case of p <C n. 

Motivated by the growing importance of gene selection problems, in this 
paper we go one step further and address the problem of existence of an 
oracle estimator and regularization estimator under an ultra-high dimen- 



REGULARIZATION FOR COX'S MODEL 



3 



sionality setting, where the full dimensionality might grow exponentially or 
nonpolynomially fast with the sample size, in order of logp = 0(n s ) for some 
5 > 0, and the intrinsic dimensionality goes to infinity, in order of s = 0{n a ) 
for a £ (0, 1). We develop a strong oracle argument, which shares the spirit 
of Fan and Li (2002), but guarantees that the folded-concave penalized par- 
tial likelihood estimator is equal to the oracle one, with probability tending 
to 1. A similar strong oracle argument was developed by Kim, Choi and Oh 
(2008) and Bradic, Fan and Wang (2011) in the contexts of linear regression 
models. Extending such results to Cox's proportional hazards model is a new 
exceptional challenge due to its nature of censoring and NP-dimensionality. 

1.1. Model setup. We consider multivariate data {(Xj, Tj)}™ =1 , which 
form an i.i.d. sample from the population (X, T), where Xj = (Xn, . . . , Xi p ) T 
is a column vector of covariates for the ith individual. For a variety of rea- 
sons not all survival times (Tj)™ =1 are fully observable. The independent 
right censoring scheme is considered where i.i.d. censoring times (Cj)™ =1 
are conditionally independent of survival times given covariates {Xj}™ =1 . 
Hence, we work with i.i.d. sample {(Xj, Zj, <5j)}" =1 , where Z% =min(Tj,Cj) 
and 5i = l{Ti < Cj} are event times and censoring indicator, respectively. 

The conditional hazard rate function of T given X = x is denoted by A(t|x). 
Cox's proportional hazards model assumes that 

(1) A(t|X) = A (t)exp(/3 T X), 

where the baseline hazard rate Ao(i) is a nuisance function. Let t\ < • ■ • < tjy 
denote the ordered failure times and (J) denote the label of the item failing 
at tj. Denote by IZj = {i £ {1, . . . , n} : Z\ > tj} the risk set at time tj and by 
Ao(i) = Jq\q{u)oIu the cumulative baseline hazard function. 

Following the approach of nonparametric maximum likelihood estimation, 
the "least informative" nonparametric modeling of Ao(i) assumes that Ao(i) 
has a jump of size 9j at the failure time tj: Ao(t;9) = Ylf=i@jl{tj — If 
we use the Breslow MLE 6j l = Y^ieTij ex P(/3 T Xj), then the penalized Cox's 
log partial likelihood becomes [Fan and Li (2002)] 

(2) Q n ({3)-nJ2px n (.\Pk\), 

where Q n (/3) = YljLiW T ^(j) ~ lo g(Eig^ exp(/3 T Xj))}, p Xn (-) is a penalty 
function, and A n is a nonnegative regularization parameter. Note that the 
covariate vector X may be time dependent and incorporated in the standard 
way in model (1) through 

X(t\X(t))= lim P{t<T<t + At|T>t,X(t)}/At = A (t)exp(/3 T X(t)). 
Af->0 

Note that from hereon we will be working with the time-dependent left 
continuous covariate vector X(t). 
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1.2. Counting process representation. Let Ni(t) = l{Zi < t,5i = 1}, 

N(t) = £" =1 -Ni(*) and Y i(t) = l i Z i ^ *}■ Note that the P rocess Y ( t ) = 
(Y\(t), . . . ,Y n (t)) is assumed to be left continuous with right-hand limits 
and satisfies P(Y(t) = 1, < t < r) > 0. Using the counting process notation, 
one can rewrite the log partial likelihood Q n {j3) for model (2) as 

QM=it r{0 F Mt)-'iog(S^^,t))}dN i (t), 

i=l J ° 

where and hereafter r is the study ending time, and 

n 

5W(/3,t) = n- 1 ^y 4 (t){X,(t)}^exp(/3 T X l (t)), £ = 0,1,2, 

i=l 

with <g> denoting the outer product. Thus, the penalized log partial likelihood 
becomes 

n ,. r V 

(3) C(^,t) = J2 {P T X i (t)-log(Sg\(3,t))}dN i (t)-nJ2pXnm)- 
i=i Jo j=i 

Define the sparse estimator (3 as the maximizer of C(/3,r) over (3 £ O p , 
where O p is the parameter space which is a compact subset of BP and con- 
tains the true value of (3. Note that Ni(t) is a counting process with intensity 
process Aj(t,/3) = Ao(i)ii(i) exp{/3 T Aj(t)}, which does not admit jumps at 
the same time as Nj(t) for j 7^ i. Denote by (3* the true value of (3 and 
Ai(t) = J * Xi(u,f3*)du. Then Mj(i) = JVj(t) - Aj(t) is an orthogonal local 
square integrable martingale with respect to filtration 

J r t , i =a{N i (u),X i (u + ),Y i (u+),0<u<t} : 

that is, (Mi(t), Mj(t)) = for i 7^ j. Let = U^i-^M De the smallest <r- 
algebra containing J 7 ^. Then M(i) = is a martingale with re- 

spect to Ft- 



1.3. Choice of the penalty function. There are many commonly used 
penalties in the literature, for example, the L2 penalty used in ridge re- 
gression; the nonnegative garrote as a shrinkage estimation [Yuan and Lin 
(2007)]; the Lq penalty for the best subset selection; the L\ penalty LASSO 
[Tibshirani (1996)] as a convex relaxation of the Lq penalty; the SCAD 
penalty [Fan and Li (2001)], defined via its derivative p'\(t) = X{I(t < A) + 
^uZ§\ I(t > A)}, t > 0, for some a > 2, as a folded-concave relaxation of Lq 
penalty; the MCP [Zhang (2010)] penalty. Recently, a class of penalties 
bridging Lq and L\ penalties was introduced in Lv and Fan (2009). All of 
these penalties are folded concave penalties, as noted in Fan and Li (2001) 
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and Fan and Lv (2011). As a collection of nonconvex relaxations of the Lq 
penalty, they serve as a tool of allowing bigger correlations among covariates 
(see Condition 8) and hence relax significantly the standard "incoherence 
condition" and control the tail bias of the resulting penalized estimator (see 
Theorem 4.2). For any penalty function p\ n (-), let p(t;X n ) = X~ p\ n (t) and 
write p{t] X n ) as p(t) for simplicity when there is no confusion. According 
to Fan and Lv (2011), the folded concave penalties are defined through the 
following Condition 1. 

Condition 1. p(t;X n ) is increasing and concave in t £ [0, oo) and has 
a continuous derivative p'(t; A n ) with p'(0+; A n ) > 0. In addition, p'(t; A n ) is 
increasing in A n G (0, oo) and p'(0+; A n ) is independent of A n . 

Note that most commonly used nonconvex penalties, including SCAD 
and MCP (a > 1), satisfy Condition 1. We will employ the folded concave 
penalties to increase flexibility of our method. LASSO penalty as a convex 
function falls at the boundary of penalties in Condition 1, and our results 
will be applicable for LASSO penalty as well. 

The rest of the paper is organized as follows. In Section 2 we deal with 
identification problem of the penalized estimator 0, which is key to the proof 
of oracle results. A compelling large deviation result is derived for divergence 
of a martingale from its compensator in Section 3. In Section 4 we work 
out the new strong oracle property and its implications for LASSO and 
SCAD and asymptotic properties of the proposed estimator. In Section 5 
we propose an iterative coordinate ascent algorithm (ICA) and examine 
a thorough simulation example; see Section 5.1. The gene association study 
is done in Section 5.2 where the non-Hodgkin's lymphoma dataset of Dave 
et al. (2004) is analyzed. Technical lemmas and proofs are collected in the 
Appendix and in the supplementary material [Bradic, Fan and Jiang (2011)]. 

2. Identification. This section gives the appropriate necessary and suffi- 
cient conditions on the existence of estimator (3. We can always assume that 
the true parameter (3* can be arranged as j3* = ((3l T ,0 T ) T , with (3\ E Q s 
being a vector of nonvanishing elements of (3*, where Q s = £l p n R s . 

Throughout the paper the following notation on a vector/matrix norm is 
used. Denote by A m i n (B) and A max (B) the minimum and maximum eigen- 
values of a symmetric matrix B, respectively. We also use A(B) to de- 
note any eigenvalue of B. Let \\ ■ \\ q be the L q norm of a vector or ma- 
trix. Then for a s x s matrix A, ||A||oo = max{X]fe=i |(A)jfc| : 1 < j ' < s} and 
II A. || 2 = {A max (A T A)} 1 / 2 . We also let a (A) be the set consisting of all of 
eigenvalues of A, and let r a (A) = max{|A| : A £ c(A)} be the spectral radius 
of A. If A is symmetric, then ?v(A) = ||A||2. 
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Since no concavity is assumed for the penalized log partial likelihood (3), 
it is difficult, in general, to study the global maximizer of the penalized like- 
lihood. One useful index controlling the convexity of the whole optimization 
problem (3) is the following "local concavity" of the penalty function p(-) 
at v = (v i, . . . , v s ) T G R s with ||v||o = s, 

(4J K(p,v) = lim max sup 



which is defined in Lv and Fan (2009) and shares similar spirit to the 
"maximum concavity" of p in Zhang (2010). Since p is concave on (0,oo), 
K (/°> v ) > 0. For LASSO penalty At(p,v) =0, whereas for the SCAD penalty 



/c(p,v) 



(a — 1) 1 X , if there exists a Vj such that A < \vj\ < aX; 
0, otherwise. 



Let /3 1 be a subvector of (3 formed by all nonzero components and s = 
dim(/3 1 ). Denote by Sj the subvector of Xj with same indexes as j3 1 in (3 
and by Qj the complement to Sj. For v = (v\, . . . ,v s ) T € R s , let p'(v) = 

(p'(vi),...,p'(v s )) T and sgn(v) = (sgn(«i), . . . , sgn(v s )) T . Partition s£'((3, 
t) = [S$(J3,t),S$Q3,t)] and 



according to the partition of (3 = (/3f ,/3^) T , so that (f3,t) is a dim(/3 x ) x 1 

vector and 5^(^,4) is a dim^) x dim^J matrix. Let E^^t) = S$((3, 

t)/S^\MMn\M = sS(^/S^\^,E n (^t) = S^(^t)/si°\/3,t), 

V(J3,t) = S^Mt)/S^(J3,t) - (S$((3,t)/Sk 0) ((3,t))® 2 and V(/3 l5 t) = 
V((/3 1 ,0),t). 

The following theorem provides a sufficient condition on the strict lo- 
cal maximizer of C((3,t). Proof is relegated to the supplementary material 
[Bradic, Fan and Jiang (2011)]. 

Theorem 2.1. If Condition 1 is satisfied, then an estimate f3 € RP is 
a strict local maximizer of the nonconcave penalized log partial likelihood (3) 
if 

(5) V f (S 4 (i) - B^0,t))dN t (t) - nX n p'{\h\) o sgnOaj = 0, 

7~~W0 



i=l 



(6) H0)\\oo = 



/ (Qi(t)-EW0,t))dNi(t) 



<nX n p'(0+), 
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(7) Xmin {J V (&*)^(*)J> nA »*(p,j9l), 

where o is the Hadamard product. Conversely, if (3 is a local maximizer 
of C((3,r), then it must satisfy (5)-(7) with strict inequalities replaced by 
nonstrict inequalities. 

When the LASSO penalty is used, n(p, v) = 0, hence the condition of non- 
singularity for the matrix in (7) is automatically satisfied with a nonstrict 
inequality. For the SCAD penalty, k(p, /3 1 ) = 0; that is, (7) holds with non- 
strict inequality, unless there are some j such that A n < \(3j \ < a\ n , which 
usually has a small chance. In the latter case, k(p,(3i) = (a — 1)~ 1 A~ 1 , and 
the condition in (7) reduces to 

X min ^ n^V0,t)dN(t)^ > l/(a-l). 

This will hold if a large a is used, due to nonsingularity of the matrix. 

It is natural to ask if the penalized nonconcave Cox's log partial likelihood 
has a global maximizer. Since p 3> n, it is hard to show the global optimality 
of a local maximizer. Theorem 4.1 in Section 4 suggests a condition for 
to be unique and global. Once the unique maximizer is available, it will 
be equal to the oracle one with probability tending to one exponentially 
fast, when the effective dimensionality s is bounded by 0{n a ) for a < 1 (see 
Theorem 4.3). In this way Theorems 2.1, 4.1 and 4.3 address uniqueness of 
the solution and provide methods for finding the global maximizer among 
potentially many. Methodological innovations among others consist of using 
equations (5) and (6) as an identification tool to surpass the absence of 
analytical form of an estimator f3. 

3. A large deviation result. In view of (5) and (6), to study a noncon- 
cave penalized Cox's partial likelihood estimator f3, we need to analyze the 
deviation of p-dimensional counting process /^{X^ii) — E n (/3*, u)} dNi(u) 
from its compensator Aj = J '{Xj(u) — E n (/3*, u)} dAi(u). In other words, 
we need to simultaneously analyze the deviation of marginal score vectors 
from their compensators. Some conditions are needed for this purpose. 

Condition 2. There exists a compact neighborhood B of (3* that sat- 
isfies each of the following conditions: 

(i) There exist scalar, vector and matrix functions defined onBx 
[0, r] such that, in probability as n — > oo for j = 0, 1, 2, supjg^rJAeBi 
t) - 8®{p x ,t)\\ 2 -> 0, for Bi e H s ,Bi C B. 
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(ii) The functions s^>' are bounded and is bounded away from on 
B x [0,t]; for j = 0,1,2, the family of functions s^\-,t), < t < r, is an 
equicontinuous family at (3*. 

(iii) Let e(/3, t) = (/3, t)/s (0) (/3,t),v(J3,t) = {(3, t)/a<® (/3, i) - {e(/3, 
t)}® 2 and E^t) = f*v(/3,u)sW(f3*,u)dAo(u). Define v(/3 1; t) in the same 

way as for V(/3 1 ,i) but with replaced by . Let 

S ft (t) = / , u)a <°> 09? , «) dA (n) 

JO 

and E^ = S^ 1 (r). Assume that the sxs matrix E^* is positive definite for 
all n and Ao(t) < oo. 

(iv) Let c n = sup t6 [ 0)T ]||E n (/3*,t)-e(/3*,t)|| oo and d n = sup t6[0)T ] \S^ (/3*, 
t)-s(°)(/3*,t)|. The random sequences c n and c? n are bounded almost surely. 

The above conditions, (i)-(iii), agree with the conditions in Section 8.2 of 
Fleming and Harrington (1991) for fixed p and in Cai et al. (2005) for diverg- 
ing p. Condition (iii) is restricted to hold on the s instead of usually assumed 
p-dimensional subspace. This is a counterpart of the similar conditions im- 
posed on the covariance matrix X in the linear regression models [see, e.g., 
Bunea, Tsybakov and Wegkamp (2007), van de Geer and Biihlmann (2009), 
Zhang (2010)]. Nonsingularity of the matrix E^* in (iii) could have been 
relaxed toward restricted eigenvalue properties like those for linear models 
[Bickel, Ritov and Tsybakov (2009), Koltchinskii (2009)] but for easier com- 
posure we impose a bit stronger condition. Condition (iv) is used to ensure 
that the score vector of the log partial likelihood, which is a martingale, 
has bounded jumps and quadratic variation. By following the discussion on 
pages 305 and 306 of Fleming and Harrington (1991), this condition is not 
stringent for i.i.d. samples. 

The following Condition 3 is coming as a consequence of martingale rep- 
resentation of the score function for the Cox model, and it is valuable in 
analyzing large deviations of counting processes. 

Condition 3. Let = J T (Xjj(t) - ej(f3*,t)) dMi(t), where ej((3*,t) is 
the jth component of e((3*,t). Suppose the Cramer condition holds for Eij, 
that is, 

E\sij\ m <m\M m ~ 2 a]/2 
for all j, where M is a positive constant, m > 2 and ctJ = var(ejj) < oo. 

In linear regression models, the large deviation is established upon the 
Cramer condition for the covariates. Condition 3 takes a similar role here 
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and can be regarded as an extension to the classical Cramer condition. More- 
over, it is trivially fulfilled if the covariates are bounded. In that sense it rep- 
resents a relaxation of typical assumption of bounded covariates. Since 



jth diagonal entry of . Define £ = (£1, . . . , £ P ) T to be the score vector of 
the log partial likelihood function Q n (f3), 



Since Mj(i) = Ni(t) — Aj(i) is a martingale with compensator Aj(i) = 

/ Xi(u,(3*) du, we can rewrite £j as Y^i=i Io{ x ij{ t ) ~ E nj(fl*,t)}(dMi(t) + 
dAi(t)), where E n j(j3*,t) is the jth component of E n (/3*,t). Note that 
Y17=l fo{Xij(t) ~ E n j(f3*,t)} dAi(t) = 0, leading to the representation of the 
form 



The following theorem characterizes the uniform deviation of the score vec- 
tor £ and is critical in obtaining strong oracle property; see Theorem 4.3 
in Section 4. To the best of our knowledge there is no similar result in the 
literature. 

Theorem 3.1. Under Conditions 2 and 3, for any positive sequence {u n } 
bounded away from zero there exist positive constants cq and c\ such that 

(8) P{\€j\ > y/nun) < c exp(-ciu„) 

uniformly over j , if v n = maxj a 1 - j u n is bounded. 

PROOF. Denote by E nj (f3*,t) andej(/3*,i) the jth components of E n (/3*,t) 
and e((3*,t), respectively. Then £j can be written as 



To establish the exponential inequality about in the following we will 
establish the exponential inequalities about £ji(t) and ^2(1")- 

Note that Cji( T ) = Ya=i e ij-> w here {ejj}" =1 is a sequence of i.i.d. random 
variables with mean zero and satisfying Condition 3. It follows from the 








= 0i(r)-0 2 (r). 
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Bernstein exponential inequality that 

(9) P(\£ji\ >a)< 2exp{-a 2 /2(na 2 + Ma)}. 

Note that M(t) is a martingale with respect to Tt\ it follows that £/2(i) is 
also a martingale with respect to Ft- Let N(t) = Y17=i -^(O- Then AN(t) = 
EiLl AiV i(*)> where and thereafter AN^t) = Ni(t) - Ni(t~) denotes the 
jump of iVj(-) at time t. Since no two counting processes iVj jump at the 
same time, we have |AJV(t)| < 1. Let A(i) = Y17=i A-i(i). By continuity of the 
compensator Aj(t) = / * \i(u,(3*) du, |AA(t)| = 0. Since M(t) = N(t) - A(t), 
\AM(t)\ = \AN(t)\ < 1. Note that Y(t) and X(i) are left continuous in t. It 
is easy to see that 

\A(n- 1 / 2 {; j2 (t))\=n- 1/2 \E nj ({3*,t)-e j ({3*,t)\ 

<n- x ' 2 sup ||E„(/3*,t)-e(/3*,t)|| 0O 

te[0,r] 

= rT x / 2 c 

which is bounded almost surely by Condition 3(vi). Note that the predictable 
quadratic variation of n" 1 / 2 ^^); denoted by (n" 1 / 2 ^^)) > is bilinear and 
satisfies that 

{n- l ' 2 t j2 {t))=n- 1 f {E nj {(3\u) - e^P^u)) 2 d(M{u)) 
J o 

= !\E nj {(3\u)-e j ((3\u)} 2 S$X(3%u)dk Q (u) 
Jo 

< (\\^ n {(i\u)-e{f3\u)\\ 2 ^\(5\u)dK {u)^b 2 n {t). 
Jo 

Obviously, b 2 n (t) < 6 2 (r) < c 2 n ((3* ,t) dA (t). Note that 

/ T (p* , t ) dA (t ) < f T s (0) 09* , i ) dA (t ) + ^ A (r) . 

By Condition 2(ii), (iii) and (vi), there exist constants < K < oo and < 
b < oo, independent of j, such that | A(n _1 / 2 £j 2 (*))l < K and (rcT 1 / 2 ^^)) < 
b 2 . It follows from the exponential inequality for martingales with bounded 
jumps [see Lemma 2.1 of van de Geer (1995)] that, for u n > 0, 

P{\&(t)\ > v^n} = P{\n- 1/2 Cj2(r)\ > u n } < 2exp 

Therefore, by Condition 3(iv), there exists a constant c > such that 

(10) P{\&(t)\ > vW} < 2exp{-cu n } 



a, 



2{Ku n + b 2 ) 
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uniformly over j. Note that 

P{\€j(r)\ > \fou n } < P{\^i(t)\ > 0.5v^n„} + P{fo 2 (r)| > 0.5^u n }. 
It follows from (9) and (10) that -P{|^(t)| > y/nu n } is bounded by 

(11) 2exp| — ^ — I + 2exp(-0.5cu n ). 

Then there exist positive constants cq and c\ such that -P{|£j( r )l > \/nu n } < 
coexp(— c\u n ) uniformly over j, if maxjir| = 0{u n ). □ 

Theorem 3.1 represents a uniform, nonasymptotic exponential inequality 
for martingales. Compared with other exponential inequalities [de la Peha 
(1999), Juditsky and Nemirovski (2011), van de Geer (1995)], it is uniform 
over all components j. Moreover, its independence of dimensionality p proves 
to be invaluable for NP variable selection. 

4. Strong oracle property. In this section we will prove a strong ora- 
cle property result, that is, that (3 is an oracle estimator with overwhelm- 
ing probability, and not that it behaves like an oracle estimator [Fan and 
Li (2002)]. We assume that the effective and full dimensionality satisfy 
s = 0(n a ) and logp = 0(n s ), for some a £ (0, 1) and 5 > 0, respectively. 
This notion of strong oracle property requires a definition of biased oracle 
estimator as it was defined in Bradic, Fan and Wang (2011) for the linear 
regression problem. 

Let us define the biased oracle estimator /3° = (/3° T , T ) T where /3° is 
a solution to the s dimensional sub-problem 

argmaxV f \0{ S;(i) - log(5^ ((/3 l5 0), t))} dN t (t) - n\ n Vp(|^ |; A n ). 

That is, ffi = argmax{C(/3 1 ,r):/3 1 G Q s } with C(/3 1 , r) = C((/3 1 , 0), r). The 
estimator (3° is called the biased oracle estimator, since the oracle knows 
the true submodel A^* = {j : /3* 7^ 0}, but still applies a penalized method 
to estimate the nonvanishing coefficients. 

Theorem 4.1 (Global optimality). Suppose that mmp 1( zn 3 A m i n {J* T V(/3 1; 

t) dN(t)} > nA n K(p,/3 1 ) holds almost surely. Then (3° is a unique global max- 
imizer of the penalized log-likelihood C(/3 1 ,r) in f2 s . 

The above theorem could be relaxed to a minimum over the level sets 
of Cox's partial likelihood in a similar manner to Proposition 1 of Fan and 
Lv (2011). Its proof is left for the supplementary material [Bradic, Fan and 
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Jiang (2011)]. For LASSO penalty, (3° is unique and is the global maximizer, 
since C(/3 1 ,r) is strictly concave. In general, global maximizers are available 
for SCAD and MCP penalties, if one uses a large parameter a. In this set- 
ting, the biased oracle estimator is unique as a solution to strictly concave 
optimization problem. Note that it still depends on the penalty function. 
The biased oracle estimator, by its definition, satisfies only equation (5) in 
Theorem 2.1. Since the vanishing component does not need any penalty, the 
smaller the penalty the less the bias. In this sense, the biased oracle estima- 
tor with the SCAD penalty has a better performance than the biased oracle 
estimator with the LASSO penalty. The former is asymptotically unbiased 
[the second term in (5) is zero], while the latter is not (see Theorems 4.4 
and 4.5). 

In order to establish asymptotic properties of we need to govern the 
conditioning number of the s x s information matrix X p* through its eigen- 
values. This is done in the following condition. 

Condition 4. r^E^*) = 0(1) and r a {Y,-}) = 0(1). 

Concerning Condition 2, positive definiteness of X^* is not enough and 
further bound on its spectrum is needed. Condition 4 is in the same spirit 
as the partial Riesz condition and is weaker than Condition A3 of Cai et al. 
(2005), where Condition 4 holds for E«*. In respect to Theorem 3.1, Con- 
dition A3 of Cai et al. (2005), ensures that max.,- cr| is bounded, therefore 
satisfying maxj crj = 0(u n ) for any positive sequence u n bounded away from 
zero. 

The following lemma controls the difference between the empirical infor- 
mation matrix with 

(■T 

\dt 



X h = [ V(p u t)sWXo(t), 
Jo 



and its population counterpart E^, and plays a crucial part in the theoret- 
ical developments of this section. 

Lemma 4.1. Assume that Conditions 2 and 4 hold. Then sup^ igB \\Z/3 1 H2 = 
P {1), \\Zp*h = O p (l) andsup y g l€B ||^g 1 -E i g 1 || 2 = Op(l). 

Proof. We prove the statement in the following three steps: 

(i) For any sxl vector function a(i) on [0,r], we have 

2 

a(t)X (t)dt 



<A (r) /" T ||a(t)|||Ao(t)rft. 
Jo 



In fact, by definition, || J"J" a(i)Ao(i) dt\\\ = X^=i(Jo~ a i(t)^o(t) dt) 2 , where a^(t) 
is the ith component function of a(i). Using the Holder inequality, we obtain 
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that 



a(t)X (t)dt 



< VA (t) / aj(t)X Q (t)dt 
2 i=1 Jo 



|a(*)|||A (*)d*. 



(ii) For any matrix function A(i) on [0,r], we have 



/ A(t)X (t)dt 
Jo 



<A (r) / \\A(t)\\ 2 2 X (t)dt. 



In fact, 



A(t)A (i)di 



= sup 
2 ||u||a=l 



sup 
|u|[a=l 



A(t)X (t)dt u 



a u (t)X (t)dt 



where a u (i) = A(t)u. Then 

||A(t)||iAo(t)cft 



sup u T A(t)^uX (t)dt 

||u||=l 



sup \\a. u (t)\\lX (t) dt 



||u||=l 



> sup 

||u|| = l JO 



K(t)\\lXo(t)dt 



Therefore, by (i), the result holds, 
(hi) By definition, we have 



{V(f3 1 ,t)-^(/3 1 ,t)}s^(f3l,t)X (t)dt 

+ [ T VifB.MS^i^t) - s^(f3l,t)}Xo(t)dt 
Jo 

sAniOSO + AnaOaO. 
Using (ii), we obtain that 

llAniOaOIll < Ao(r) f ||V(^,t) - v(/3 1 ,t)||2( s (°)(/3*,t)) 2 Ao(t)dt. 



Then, by Condition 2, sup^ lg;B ||A n i(/3 1 )||2 = o p (l). Similarly, 

sup ||A„ 2 (/3i)||2 = o p (l). 
PieB 
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Therefore, 

(12) sup ||% -£/3j 2 ^ SU P ll A m(/3i)||2 + sup ||A n2 (/3l)||2 = O p (l). 

/3iGB fc£B 0i&B 

By Condition 2(ii), we have 

sup ||£ ft || 2 < / sup ||v09 1 ,«)|| 2 s (o) C9i,u)ciAo(«) = O p (l). 
PieB Jo heB,te[o,r] 

This combining with (12) leads to 

sup 11X^112 < sup ||£ /3l || 2 + sup -S ( g 1 || 2 = O p (l). 
0ieB free fteB 

Decompose X^, 1 as 

and let .4 = 7 + £~. 1/2 (X0. -E^E"^ 2 . ThenX^* 1 = E^U" 1 ^. 1 / 2 . Using 
the Bauer-Fike inequality [Bhatia (1997)], we obtain that 

\X(A) - 1| < ||S^ /2 (X^ - ^)S-/ /2 || 2 < ||S^ /2 || 2 ||X^ - ^|| 2 ||S-y 2 || 2 . 

Then by (12) and Condition 4, |A(.A) -1| = o p (l). Hence, A(^4 _1 ) = l + o p (l). 
Since A is symmetrical, ||^4 _1 || 2 = O p (l). This together with Condition 4 

yield that HJ" 1 ^ < W^^UA^hW^^h = O p (l). □ 

The following tail condition is needed as a technicality in establishing 
estimation loss results on the oracle estimator 0°. 

Condition 5. E{ S up < t < T Y(t)\\S(t)geMPfS(t))} = O(s). 

For a fixed effective dimensionality s, Condition 5 is implied by the fol- 
lowing condition from Andersen and Gill (1982): 

(13) E{ sup y(t)||S(t)|||exp(/3fS(t)))<oo. 

However, we deal with diverging s, the above condition (13) is obviously too 
tight to be satisfied. For example, when all variables in A4* are bounded, 
we have ||S(t)||| = 0(s). In general, if each Sk(t) in S(t) satisfies (13), then 
Condition 5 holds. Now we are ready to state the result on the existence of 
the biased oracle estimator. 

Theorem 4.2 (Estimation loss). Under Conditions 1, 2 and 4, 5, with 
probability tending to one, there exists an oracle estimator (3° such that 

0° ~ 0*\\ 2 = Op{J-s{n- l l 2 + \ nP '(P* n ))}, 
where /3* = min{|/3||,j £ M*} is the minimum signal strength. 
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Proof. Since (3° = P2 = 0, we only need to consider the subvector in the 
first s components, that is, we can restrict our attention to the s-dimensional 
subspace £ M s : /3_A4= = 0}. It suffices to show that, for any e > 0, there 
exists a large constant B and 7„ = B{yfs{n~ 1 / 2 + A n //(/3*)} such that 

P{ sup C(/3t + 7 nU,0)<C(/31,0))>l-e, 

Il«ll2 = l J 

when n is big enough, where for short C((3) denotes C(/3,t), and in par- 
ticular C(/3 1 ,0) represents C((/3 1 ,0),r). This indicates that, with probabil- 
ity tending to one, there exists a local maximizer such that \\/3° — /3*||2 = 
0^(n- 1/2 + Anp'(/3*))}. 

Let Ei 1) (/3 1 ,t)=El 1) ((/3 1 ,0),t), V n {^) = n\ n £ * j=1 p(\^\; A n ) and 

C/n^i) = 9£(/3!) = V f {S*(t) - Ei 1 '^^)} diVi(t). 

By the Taylor expansion at 7 n = 0, 

C(/3;+7nU,0)-C(/3i,0) 

(14) = u r [/ n (/3t) 7n + 0.5^u T dU n (Pl)u + r^OSO 

-P n (/31 + 7ri u,0) + P n (/3t), 
where the remainder term r n (/3i) is equal to 

|E(Ai-«»(A»-«i)W«-«i)^^ 

with £7^ being the £th component of ?7 n and /3 1 lying between (3\ + 7 n u 
and /3*. By Lemma 2.2 in the supplementary material [Bradic, Fan and 
Jiang (2011)] we have ||C/ n (/3^)||2 = Op(^/ns). It follows that 

(15) \u T U n (Plh n \ = O p (V^ln). 

By simple decomposition, we have dU n {f3\) = —n{Xp x + where Tp 1 was 

defined in Lemma 4.2 and = rT 1 \ r (/3 1 ,t) dM(t). Hence, 

1 2 n u T dU n (P* 1 )u = -n 1 2 n {u T (-n- 1 dU n (f3* 1 ))u} 

= -n 7 2{u T S^u + u T [(If}. - E/,.) + W/j r ]u}. 

By Lemma 2.3 in the supplementary material [Bradic, Fan and Jiang (2011)] 
and Lemma 4.1, 

|| (i K _ + Wja . || 2 < \\x K - E K \\ 2 + \\W^ || 2 = o P (l). 
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Therefore, by Condition 4, there exists a constant c > such that 

(16) jlu T dUM)u<-cn^(l + o p (l)). 

Since ||/3 1 — (3*\\2 < In and the average of i.i.d. terms, n~ l , is of 

order O p (l), we have r n (Pi) = O p (n7 n ). By concavity of p and decreasing 
property of p' from Condition 1, 

s 

|P n G9; + 7 nU,0) - V n {(3\)\ = n\ n £>(l^ + 7n%|; An) - p(\%\; Xn)\ 

3=1 

<nA ri7 n||Po(/3;)||||u|| 2 (l + 0p (l)), 

where /?* is the minimal signal length and Pq(-) is the subvector of p(-), 
consisting of its first s elements. Then 

(17) \V n (Pl + 7 „u, 0) - VniPDl = O p (n\ n ^Jnp'(P* n )). 
Combining (14)-(17) leads to 

C((3l + 7n u, 0) - C(j3l, 0) < n ln {O p {y/7J^ + ^sXnp'iPD) - c 7n (l + o p (l))}, 

where with probability tending to one, the RHS is smaller then zero when 
7„ = B(^s/n + s\ n p'((3*)) for a sufficiently large B. □ 

A simple corollary of this theorem is that the Li^L^ estimation losses of 
the oracle estimator are bounded by s(n~^ 2 + \ n p' (/?*))} and by y/s(n~ 1 / 2 + 
A n p'(/3*)), respectively. Hence, L\ loss can have a chance to be close to zero 
only if the sparsity parameter a < 1/2, whereas loss will converge to 
zero with no restrictions on a. 

To make the bias in the penalized estimation negligible, p'(/3 n ) needs to 
converge to zero at a specific rate controlled by the next condition. 

Condition 6. The regularization parameter A n satisfies that i/s\ n p' (f3* ; 
A n ) and A n » n -o.5+(o.5a+ai-i) + +a 2 ) where ai ig defined in Condition 8, 
and «2 is a positive constant. 

Condition 6 regulates the behavior of the regularization parameter A n 
around and oo. From the result of Theorem 4.2, we see that for different 
penalties, the "extra term" ^\ n p' ({3*; \ n ) in the L2 estimation loss will 
require either extra conditions on the A n or extra conditions on the minimum 
signal strength f3* (see Theorems 4.3-4.5 for further details) and can govern 
estimation efficiency of the penalized estimators. 

Condition 7. Let k = max^^ k(p,5), where A/o = {5 <E R s : ||<5— Pl\\oo < 
/?*}. Assume that A n and /3* satisfy that (i) /3* > y^ra -1 / 2 + \ n p'((3*)) and 
(ii) A min (S / g*) > X n K . 
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Condition 7(i) is employed to make /3° fall in A/o with probability tending 
to one. For LASSO, since p'(Pn) = 1> ^ means that (3* S> y/s\ n . By Condi- 
tion 6, it reduces to (3* > ^/i n -o.5+(o.5 a+Ql -i) + a 2> For SCAD) if p* > A n , 
then p'((3*) = when n is large enough and hence it requires that (3* 3> 
Y^sn -0,5 . Therefore, Condition 7(i) is less restrictive for SCAD-like penalties. 
Condition 7(h) is used to ensure the condition in (7) holds with probability 
tending to one (see the proof of Theorem 4.3). It always holds when kq = 
(e.g., for the LASSO penalty) and is satisfied for the SCAD type of penalty 
when (3* » A n . 

Condition 8. For «i > and < C < oo, 

sup sup \\V(t, V )\\ 2 ^ = mm(c^^,O p (n^)) 1 

0<t<rviGB(/3*,/3*) V PKPn) / 

where B ((31,(3*) is an s-dimensional ball centered at (3\ with radius (3*, for 

v = K,o^, 

Vft V ) = (v. t)S$! (v, t) - gg (v, *) GgS (v, ^)) T e R (p- S )x, 

{Sf(v,t)}2 

and ||V(t, v)|| 2 ,oo = max|| x || 2=1 ||V(t, v)x|| 00 . 

As noted in Fleming and Harrington [(1991), page 149], V(/3,t) is an 
empirical covariance matrix of Xj (t) computed with weights proportional to 
Yi(t) x exp{/3 T Xj(£)}. Hence, V is the empirical covariance matrix between 
the important variables Sj(t) and unimportant variables Qi(t). Condition 8 
controls the uniform growth rate of the norm of these covariance matrices, 
a notion of weak correlation between Sj(i) and Qi(t). For the L\ penalty, 
p'Wn) = 1> an< ^ Condition 8 becomes a version of "strong irrepresentable" 
condition [Zhao and Yu (2006)] for censored data. It is very stringent as 
the right-hand side has to be bounded by 0(1). On the other hand for 
the SCAD penalty, if (3* 3> A n , then p'((3*) = when n is large enough. 
Therefore, Condition 8 is significantly relaxed to 0(n ai ). In general, when 
a folded concave penalty is employed, the upper bound on the right-hand 
side in Condition 8 can grow to infinity at polynomial rate. This was also 
noted in the work of Fan and Lv (2011) in the context of generalized linear 
models. 

Theorem 4.3 (Strong oracle). Let the oracle estimator (3° be a local ma- 
ximizer o/C(/3 1 ,r) given by Theorem J±.2. //maxj(<rj) = O(n^ 0,5a+ai ~ 1 ^ ++a2 ) , 
and Conditions 1-8 hold, then with probability tending to one, there exists 
a local maximizer (3 of C((3,t) such that 

P0 = P°) >l-co(p-s)exp{-cin(°- 5a+ai - 1 )+ +a2 } ) 
where cq and c\ are positive constants. 
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Proof. It suffices to show that 0° is a local maximizer of C(0,t) on 
a set O n which has a probability tending to one. By Theorem 2.1, we need 
to show that, with probability tending to one, 0° satisfies (5)-(7). Since 0° 
already satisfies (5) by definition, we are left to check (6) and (7). 

Define O n = : \\^j^c ||oo < y/nu ri } for some diverging sequence u n to be 
chosen later, where $,j^[c is the subvector of £ with indices in By The- 
orem 3.1, there exist positive constants cq and c\ such that 

> \/nu n ) < c exp{-ciM n } 

uniformly over j. Then using the Bonferroni union bound, we obtain that 

jeM$ 

(18) 

>l-c (p-s)e ClUn -> 1 asn-^oo, 

where u n can be chosen later to make (p — s)e~ ClU " — > 0. We now check if (6) 
holds for 0° on the set £l n . Denote by p' Mc the subvector of p'(\f3°\) with 

indexes in M%. Let 7 (/3) = f*si 1) ((3,u)/S^ ) ((3,u)dN(u) and 

z (/3°) = E A Q< (* ) - E - } ^° ' * ) > diV « (*) > 

i=l Jo 

where El 2) (/3,i) = s2(/3,t)/,si 0) (/3,i). Then by Condition 1, we have 

H0°)\\oo < U M %\\oo + WlM%(0*) ~ lMi0°)\\oo 

(19) =o(^Tiu n + 



= 0[y/nu n + sup sup \\V{u, vi)|| 2 ,ooPi - Pih 
v o<u<T Vl eB(0 lt p*) 

where v (vf,0 T ) T with vi being between 0* and 0°, and V"(tt,Vi) is 
defined in Condition 8. By Theorem 4.2 and Condition 8, we obtain that 
(nA n x p / (0+))- 1 ||z( / 3°)|| oo is bounded by 

n- 1 \- 1 O p \^iu n + sup ||V - (u,vi)|| 3 ,ooVs(n" 1/2 + A n /t/(/3;))} 



= O p {n- 1 / 2 A^ 1 (u n + n - 5 ^ 01 " 1 ) + n- 1+a5 V (0+)} ->• 0, 
if we take u n = n (o.5a+ ai -i) + +a 2 and An > n -o.5+(o.5a+ai-i)++a 2- There- 
fore, (6) holds on O n . Once 5 < (0.5a + a\ — 1) + + a2> (18) holds. 

We are now left to show that (7) holds with probability tending to one, 
that is, Aminj?!" 1 Jq V(/3°, t) dN(t)} > \ n K{p,0°), which is guaranteed by 
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Condition 7. In fact, by Theorem 4.2 and Condition 7(i), with probability 
tending to one, (3° falls in A/"o as n — > oo, so that k(p,0°) < ko. Hence, by 
Condition 7(h), with probability tending to one, 

(20) A min (S^)>A n K(p,^). 
Recall that 

J 

By Theorem 4.2, as n — > oo, (3° 6 B with probability tending to one. This 
combining with Lemma 4.1 and Lemma 2.3 in the supplementary material 
[Bradic, Fan and Jiang (2011)] leads to 



n- 1 I V(p1,t)dN(t) = S Bo +E, 
Jo 1 



where ||i?||2 = o p (l). By Condition 2(i), (hi), with probability tending to 
one, 

| | Ego - ][ 2 = Op(l). 

Let E* = n' 1 fiV(ffi,t)dN(t) - Then ||£*|| 2 = o p (l). Using Weyl's 
pertubation theorem [Bhatia (1997)], we obtain that 



mm 

Kk<s 



< \\E* 



where Afc(X^») is the kth. largest eigenvalue of £»*• Therefore, 
Arninjn" 1 jT V(0?, *) d#(*)j = A min (5]^) + Op(l). 
This combining with (20) yields that with probability tending to one 

Aminj^ 1 jT V()9?, t) diV(t) | > X n K,(p, fa). 



□ 



The theorem becomes nontrivial if 5 < (0.5a + a\ — 1)+ + a 2 , since logp = 
0(n s ). Apart from the work of Bradic, Fan and Wang (2011), no formal work 
explicitly relates the oracle property and the full and effective dimensional- 
ities. Theorem 4.3 shows that f3 becomes the biased oracle with probability 
tending to one exponentially fast. Then combining Theorems 4.2 and 4.3 
leads to the following L 2 estimation loss: 

(21) 0i ~ Plh = Opi^n' 1 / 2 + X n p'(f3*))}. 

This theorem tells us that the resulting estimator behaves as if the true set 
of "important variables" (i.e., as oracle estimator) were known with prob- 
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ability converging to 1 as both p and n go to oo. The previous notions of 
oracle were that the estimator behaves like the oracle rather than an actual 
oracle itself. Classical oracle property of Fan and Li (2002) or sign consis- 
tency of Bickel, Ritov and Tsybakov (2009) are both corollaries of this result. 
In this sense Theorem 4.3 introduces a tighter notion of an oracle property. 
It was first mentioned in Kim, Choi and Oh (2008) for the SCAD estima- 
tor of the linear model with polynomial dimensionality and then extended 
by Bradic, Fan and Wang (2011) to the penalized M-estimators under the 
ultra-high dimensionality setting. Extending their work to Cox's model was 
exceptionally challenging because of martingale and censoring structures. 

Theorem 4.4 (LASSO). Under Conditions 2-5, z/rnaxj(er|) = 0(n a2 ), 
y/sX n ->Q, \ n > n -°- 5+Q2 and 

sup sup ||V(*,v)|| 2j oo =O p (i), 

0<t<rv lG B(/3J,/3*) 

then the result in Theorem 4-3 holds for LASSO estimator with probability 
being at least 1 — cq(p — s) exp{— c\n a2 }. Furthermore, 

\\P 1 -(3l\\ 2 = Op(^\n). 

The proof of this theorem is relegated to the supplementary material 
[Bradic, Fan and Jiang (2011)]. For the LASSO, the rate of convergence 
for nonvanishing components is dominated by the bias term A n 3> n -1 / 2 . 
In addition, since s = n a , the condition \fs\ n — > indicates that a < 1 — 
2a 2 , where a 2 £ [0,1/2). That is, the bigger is a 2) and the smaller sparsity 
dimension s can be recovered using LASSO. Moreover, LASSO with a 2 < 1/2 
requires p <C expjcin" 2 } to achieve the strong oracle property. Hence, as p 
(or a 2 ) gets bigger, s (or a) should get smaller. This means that, as data 
dimensionality gets higher, recoverable problems get sparser. This is a new 
discovery and has not been documented in the literature. On the other hand 
for folded concave penalties, faster rates of convergence are obtained with 
fewer restrictions on p and s. This can be seen from the following result, 
which is a straightforward corollary of Theorem 4.3 and whose proof is left 
for the supplementary material [Bradic, Fan and Jiang (2011)]. 

Theorem 4.5 (SCAD). Under Conditions 1-5, if /3* » X n , maxj(<7 2 ) = 

O ( n (0.6c»+oi-l)++a2) j An>n -0.5+(0.5a+ai-l)++a 2 and 

sup sup ||V(i, v)|| 2>00 = O p (n ai ), 
0<*<rvieB(/3J,/3*) 

then the result in Theorem 4-3 holds for SCAD estimator with probability 
being at least 1 — co(p — s) exp{— Ci?7 ,(0- 5q + Q! i~ 1 )++ Q! 2|_ Furthermore, 

\\p l -f3l\\ 2 = Op{^7Jn). 
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Note that the proof of Theorem 4.3 shows (3 2 = on a set whose prob- 
ability measure is going to one exponentially fast. For statistical inference 
about (3, asymptotic properties of (3 1 are needed to be explored. To be able to 
construct confidence intervals of (3^ we need to derive its asymptotic distri- 
bution. This was done in Fan and Li (2002) for fixed p and in Cai et al. (2005) 
for p = o(n 1//4 ). Here, we allow p to diverge at exponential rate 0(ex.p{n s }) 
and the effective dimensionality s to diverge at rate of o(n 1 / 3 ). To the best 
of our knowledge there is no work available for such a setting. Extending 
the previous work to such a NP-dimensional setting is not trivial and re- 
quires complicated eigenvalue results. Moreover, the large deviation result 
in Section 3, the strong oracle result in Theorem 4.3 and Lemmas 2.1-2.3 in 
the supplementary material [Bradic, Fan and Jiang (2011)] are essential for 
establishing the desired asymptotics. Moreover, the following Lemma 4.2 is 
an important extension of the classical asymptotic Taylor expansion results 
when the number of parameters is diverging with the sample size. 

Lemma 4.2. For any s x 1 unit vector h n , let 

cf> n = b^Vn" 1 dUniPX^n-^UniPl) 

and 

If Conditions 2, 4 an d 5 hold and ifs = o(rz 1 / 3 ) ? then 4> n = cf) n i + o p (l). 

— 1/2 — 1/2 

Proof. Let B = I+Tp 1 Wp*!^' , where / is an s x s identity matrix. 
Using the Bauer-Fike inequality [Bhatia (1997)], we obtain that 

\X(B)-1\<\\1^ /2 W^%. 

Then by the Holder inequality we have |A(£>) — 1| < ||2^* 1//2 ||2l|W ( g* ||2- Ap- 
plying Condition 4 and Lemma 4.1 and Lemma 2.3 of the supplementary 
material [Bradic, Fan and Jiang (2011)], we establish that 

(22) \{B) = 1 + O p {s/^) 

uniformly for all eigenvalues of B. Note that 

It follows that 

K = h^l-y^U n ((3l) - hl^gl-^il - B^}l^ 2 n^U n ((3l) 

= (f) n i - (j} n2 . 



22 



J. BRADIC, J. FAN AND J. JIANG 



Since/ — B 1 is symmetrical, r a {I — B 1 ) = \\I — B 1 ||2- Recall that ||b n ||2 = 
1; it follows that 

By Condition 4, H2 = O p (l). From Lemma 4.1, we have ||Xg» H2 = 

O p (l). By Lemma 2.2 in the supplementary material [Bradic, Fan and Jiang 
(2011)], \\n- l l 2 U n {f3\)\\ 2 = O p {^s). Therefore, 

(23) \cP n2 \=r a (I-B~ 1 )0 P (V^). 

By definition, it is easy to see that 

r a {I - B" 1 ) = max{|l - A| : A € c^iT 1 )} = max{|l - A" 1 ] : A e a(B)}, 

which, combined with (22), leads to r a (I — B~ l ) = O p (s / \fn). This together 
with (23) yields that 4> n 2 = O p (y/ s 3 / T n) = o p (l), if s = o(n 1 / 3 ). Hence, <j) n = 

4>nl+O p (l). □ 

With the Lemma 4.2 and technical lemmas presented in the supplemen- 
tary material [Bradic, Fan and Jiang (2011)] we are ready to state the results 
on the asymptotic behavior of the penalized estimator. Detailed proof is in- 
cluded in the supplementary material [Bradic, Fan and Jiang (2011)]. 

Theorem 4.6. Under Conditions 1-8, and for \ n p' '(/?*) = o({sn)~ 1 / 2 ) 
for any s x 1 unit vector h n , if s = o{n 1 / 3 ), the penalized partial likelihood 
estimator /3± from (21) satisfies 

v^b^f^-^D^AT^l). 

Theorems 4.3 and 4.6 claim that f3 enjoys model selection consistency and 
achieves the information bound mimicking that of the oracle estimator (3°. 

5. Iterative coordinate ascent algorithm (ICA). Coordinate-wise algo- 
rithms are especially attractive for p>n and have been previously intro- 
duced for penalized least-squares with the Lg-penalty by Daubechies, De- 
frise and De Mol (2004), Friedman et al. (2007), Wu and Lange (2008) and 
for generalized linear models with the folded concave penalty by Friedman, 
Hastie and Tibshirani (2010) and Fan and Lv (2011). By Condition 1 and 
Proposition 2.7.1 in Bertsekas (2003), the coordinate-wise maximization al- 
gorithm in each iteration provides limits that are stationary points of the 
overall optimization (3). Therefore, each output of ICA algorithm will give 
a stationary point. We will adapt the algorithm in Fan and Lv (2011) to the 
censored data. 
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First, let us, with slight abuse in notation, denote by Q n (P) = L n {j3) — 
P n (P), where L n (-) and P n (-) stand for the loss and penalty parts, respec- 
tively. Let Z n (/3,C,j) be the partial quadratic approximation of L n ((3) at 
C £ BP along the jth coordinate, where {(3^ = (k,k j} are held fixed, but f3j 
is allowed to vary 

qn(P J ,(,3) = U(3,C,j)-np Xn (W 3 \). 

Because of the complex likelihood function we need an additional loop to 
compute the partial quadratic approximation. 

This penalized quadratic optimization problem can be solved analytically, 
avoiding the challenges of nonconcave optimization. It updates each coor- 
dinate if the maximizer of the penalized univariate optimization strictly 
increases the objective function Q n ((3) and if it satisfies {j : \zj \ > p'(0+)}. 
The algorithm stops when two values of the objective function Q n {(3) are 
not different by more than 10 -8 , say. Details of the algorithm are presented 
in the supplementary material [Bradic, Fan and Jiang (2011)]. 

5.1. Simulated examples. To show good model selection and estimation 
properties of the proposed methodology, we simulated 100 standard Toeplitz 
ensembles of size 100 with population correlation p(Xi,Xj) = p^~^ with p 
ranging from 0.25, 0.5, 0.75 and 0.9. The distribution of censoring time C is 
exponential with mean C/*exp{X?p}, where U is randomly generated from 
uniform distribution over [1,3] for each simulated data set. This censoring 
was used in Fan and Li (2002), which makes about 30% of the data censored. 
The full and effective dimensionalities of the true parameter (3 are taken 
as {100,4}, {1,000,4}, {5,000,4} and {1,000,25}, respectively, with values 
±1 randomly placed (the rest is set as zero). The penalties employed are 
LASSO [Tibshirani (1996)], SCAD [Fan and Li (2001)], SICa [Lv and Fan 
(2009)] withp A (|/3j|) = (A + l)|/3 i |/(A + |/3 j |) and MCP+ [Zhang (2010)] with 
all regularization parameters being computed with 5-fold sparse generalized 
cross validation; see Section 5.2 and Table 2 therein for detailed discussion 
on the choice of cross validation statistics. 

The results of the simulations are summarized into three tables (see Ta- 
ble 1 in the main text and Tables 2 and 3 in the supplementary material 
[Bradic, Fan and Jiang (2011)]) where we reported the median prediction 
error (PE) 

(24) P n [exp{-/3* T X} - exp{- / 3 r X}] 2 , 

where P n stands for the empirical probability measure. We also report the 
median number of nonzero parameters estimated in the set as the num- 
ber of true positives TP. Furthermore, we summarize the median number of 
nonzero estimates of the set A4 1 as the number of false positives FP. 
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Table 1 

Simulation results for p>n under correlation settings ranging from 0.25 to 0.90 with 
medium prediction error (MPE), # of true positives (TP), # of false positives (FP) and 
standard deviation in parenthesis of each estimate 



MPE TP FP MPE TP FP 



Settings of n = 100, p = 100, s = 4 
Case p = 0.25 Case p = 0.5 



Oracle 


0.0154 


(1.27*) 


4 







0.0215 


(1.97*) 


4 







LASSO 


0.0178 


(1.26) 


4 (1.61) 


2 


(33.34) 


0.0284 


(2.12) 


4 (1.52) 


13 


(33.02) 


SCAD 


0.0161 


(1.24) 


4 (1.61) 


2 


(34.21) 


0.0223 


(2.03) 


4 (1.52) 


13 


(35.56) 


SICa 


0.0190 


(1.27) 


3 (1.48) 


2 


(26.11) 


0.0275 


(2.43) 


3 (1.44) 


9 


(21.54) 


MCP+ 


0.0166 


(1.22) 


3 (1.71) 


2 


(32.49) 


0.0271 


(2.33) 


4 (1.54) 


24 


(34.62) 






Case p = 0.75 






Case p = 0.9 






Oracle 


0.0322 


(2.05*) 


4 







0.0538 


(4.43*) 


4 







LASSO 


0.0371 


(2.42) 


3 (1.14) 


12 


(31.21) 


0.0665 


(4.62) 


2 (1.48) 


13 


(32.16) 


SCAD 


0.0326 


(2.12) 


4 (1.14) 


12 


(31.53) 


0.0549 


(3.36) 


3.5 (1.49) 


8 


(31.11) 


SICa 


0.0343 


(2.27) 


2 (1.30) 


3 


(18.41) 


0.0566 


(3.26) 


2 (1.32) 


6 


(24.42) 


MCP+ 


0.0326 


(2.21) 


3.5 (1.22) 


12 


(32.31) 


0.0558 


(3.44) 


2.5 (1.29) 


15 


(29.68) 








Setting 


;s of n — 


100,?) = 1,000, s = 


4 










Case p = 0.25 






Case p = 0.5 






Oracle 


0.0154 


(1.27*) 


4 







0.0215 


(1.97*) 


4 







LASSO 


0.0201 


(1.38) 


4 (0.85) 


23 


(371.8) 


0.0383 


(3.16) 


3.5 (1.23) 


45 


(532.1) 


SCAD 


0.0162 


(1.25) 


4 (0.83) 


15 


(323.4) 


0.0281 


(2.12) 


4 (1.12) 


36 


(430.3) 


SICa 


0.0189 


(1.17) 


3.5 (0.54) 


9 


(120.5) 


0.0492 


(3.18) 


3 (1.43) 


15 


(319.4) 


MCP+ 


0.0192 


(1.23) 


4 (0.83) 


17 


(345.5) 


0.0281 


(2.15) 


4 (1.12) 


36 


(409.2) 






Case p = 0.75 






Case p = 0.9 






Oracle 


0.0322 


(2.05*) 


4 







0.0538 


(4.43*) 


4 







LASSO 


0.0497 


(3.16) 


3 (0.44) 


96 


(306.5) 


0.0703 


(4.24) 


3 (1.54) 


97 


(411.5) 


SCAD 


0.0358 


(2.45) 


4 (0.34) 


85 


(250.7) 


0.0583 


(4.13) 


4 (1.51) 


67 


(380.9) 


SICa 


0.0372 


(2.15) 


2 (1.30) 


90.5 (90.3) 


0.0546 


(3.98) 


1 (1.78) 


30 


(354.1) 


MCP+ 


0.0361 


(2.77) 


3.5 (1.14) 90 (320.4) 


0.0592 


(4.25) 


3.5 (1.58) 


98 


(402.3) 








Setting 


;s of n — 


W0,p = 5,000, s = 


4 










Case p = 0.25 






Case p = 0.5 






Oracle 


0.0154 


(1.27*) 


4 







0.0215 


(1.97*) 


4 







LASSO 


0.0220 


(1.49) 


4 (1.05) 


68 


(398.1) 


0.0462 


(4.05) 


3.5 (1.64) 


33 


(206.8) 


SCAD 


0.0170 


(1.28) 


4 (1.05) 


67 


(298.2) 


0.0328 


(3.15) 


3.5 (1.56) 


21. 


5 (205.4) 


SICa 


0.0195 


(1.19) 


2.5 (1.17) 


14 


(345.7) 


0.0285 


(3.35) 


4 (1.41) 


30 


(323.3) 


MCP+ 


0.0188 


(1.29) 


3 (1.10) 


67 


(298.2) 


0.0358 


(2.85) 


3.5 (1.51) 


73. 


5 (348.7) 






Case p — 0.75 






Case p — 0.9 






Oracle 


0.0322 


(2.05*) 


4 







0.0538 


(4.43*) 


4 







LASSO 


0.0567 


(5.02) 


3 (1.73) 


23 


(250.5) 


0.0865 


(4.52) 


2 (1.23) 


59 


(208.8) 


SCAD 


0.0360 


(2.31) 


4 (1.51) 


18 


(234.7) 


0.0596 


(4.12) 


4 (0.89) 


49 


(105.4) 


SICa 


0.0385 


(2.13) 


2.5 (1.30) 


3 


(225.2) 


0.0602 


(4.92) 


3 (0.45) 


46 


(90.3) 


MCP+ 


0.0392 


(2.82) 


4 (1.74) 


4 


(326.2) 


0.0578 


(4.33) 


4 (0.89) 


11 


(217.1) 



* stands for column of standard deviation x 100. 
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Table 1 summarizes three p>n examples, where especially the last two 
stress the strengths of the methods when p^> n and spectra of the design 
matrix is high; see Table 1 in the supplementary material [Bradic, Fan and 
Jiang (2011)]. All four methods work quite well, where LASSO has higher 
PE than the rest, with SCAD and MCP+ performing quite closely to each 
other. SICa performs worse than others, always loosing a number of TPs. 
The case of p = 0.90 affects all methods in bigger prediction error and smaller 
number of TP, where the jump is the largest in LASSO penalty. SCAD 
and MCP keep their performance similarly to the oracle one through all 
examples, hence verifying the strengths of nonconvex penalties. For more 
detailed discussions and results when the oracle estimator fails, when the 
censoring rate is too high and assessing the relative estimation efficiency of 
LASSO estimator with respect to SCAD, SICa and MCP+, we direct you to 
the supplementary material [Bradic, Fan and Jiang (2011)] for this paper. 

5.2. Real data example. To demonstrate the strength of the proposed 
methodology, in this section, we present gene association study with respect 
to the survival time of non-Hodgkin's lymphoma. Genetic mechanisms re- 
sponsible for the clinical heterogeneity of follicular lymphoma are still un- 
known. Dave et al. (2004) have collected gene expression data on 191 biopsy 
specimens obtained from patients with untreated follicular lymphoma. RNA 
was extracted from fresh-frozen tumor-biopsy specimens and survival times, 
from 191 patients, who had received a diagnosis between 1974 and 2001, 
which were obtained from seven institutions and examined for gene expres- 
sion with the use of Affymetrix U133A and U133B microarrays. The median 
age at diagnosis was 51 years (range, 23 to 81), and the median follow-up 
time was 6.6 years (range, less than 1.0 to 28.2). The dataset was obtained 
from http : / /llmpp . nih . gov/FL. 

The full cohort study included 44,187 probe expressions values out which 
only 34,188 were properly annotated. Among these, some received multiple 
(2-7) measurements per gene. We took the median value as a unique repre- 
sentative and were left with 17,118 different genes presented. We separated 
the dataset into training and testing sets with 80% and 20% of censored 
samples, respectively. The censoring rate of 50% was kept in each of the 
training and testing samples. Recorded for each individual are follow up 
time, indicator of the status at the follow up time and measurements of 
expression value for each Affymetrix probe set. 

The classical L fold cross-validation is defined as 

k=l 

where / stands for the partial likelihood and l^~ k ^ for the partial likelihood 

" (— k) 

evaluated without the kth subset and similarly f3\ for the penalized es- 
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Table 2 

Data summary with number of nonzero elements reported on the whole data set and 
prediction error and its standard deviation x 100 comparisons reported on the training 

set [Dave et al. (2004)] 



LASSO 



SCAD 



SICA 



MCP+ 



cv 

# of nonzeros 
Prediction error 

SGCV 

# of nonzeros 
Prediction error 



2145 
0.1516 (1.51) 

24 

0.0812 (1.03) 



653 
0.1276 (1.60) 

26 

0.0643 (1.02) 





0.1898 (- 


0.1898 (- 



154 

0.1743 (1.45) 
13 

0.1043 (0.78) 



timator derived without using the kth. subset. The measure of information 
contained in the full Cox partial likelihood is biased with respect to the num- 
ber of nonzero elements and proper normalization is needed. The method 
of generalized cross validation proposed by Fan and Li (2002) works very 
well for small p but fails for large p because of its dependence on the inverse 
of the Hessian matrix of the partial likelihood. This inspired us to define 
a sparse approximation to the generalized cross-validation as 

K( /(- fc )(Mr fc) ) 



V 



SGCV(A) 

h 



^\n{l-s x /n} 2 n(" fc ){l - s A /n(- fc )} 2 

where s\ = ||/3^ ^||o and n^-~ k ^ stands for the sample size of the whole set 
without the kth. subset. Then, we choose the regularization parameter as 

A = argminSGCV(A). 

A : s\<n 

We applied 5-fold cross validation on the test set and evaluated its per- 
formance on the training set. The Nelson-Aalen estimate of the cumulative 
hazard rate function was used. The results are summarized in Table 2 and 
show a big difference between the classical CV statistics and generalized 
one. The CV, being not scaled to the number of nonzero elements always 
prefers models with bigger number of nonzeros. Note that s > n, for small A, 
is caused by the artifact of ICA algorithm. 

The SICa penalty completely fails in this example. It detects nonzeros 
only in 3 grid points with the number of nonzeros as 2, 3 and 879. Both CV 
methods fail to pick up the optimal one among the three points and choose 
the fourth one, which lead to no signal detection. This is not unexpected, 
since in all simulations SICa was always picking the least number of TP+FP; 
see Table 1. 

Table 3 depicts the estimation results of the sparse generalized cross vali- 
dation method with LASSO, SCAD and MCP+ penalties. All three penalties 
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Table 3 

Data estimation summary of the genes selected by the sparse generalized cross validation 
with standard deviationxlOO^ reported in the parenthesis 



Gene annotation 


LASSO 


SCAD 


MCP+ 


FOSR l'RC!036794 s l 


— D 0093 O 34**1 
u.uuyo i Zi.u^ j 




X 


—0 0097 fl 54-h 


ClARRAfi ( AK0Q0735 1 ! 


o nn7n (a 56 1 )* 


0.0150 


(1.00*)*** 


ys 


V.i 1 1 1 V I 1 1 Jr\.\\ _104:00^ I 


V 


-0.0489 


(1.39)** 




hnht i i'RrTOnQ c >fi'i 

VjrlN Vjr J — ± ^UOUOUyOUj 




-0.0041 


(0.46) 




HTST1H1E 1^11603483"; 

_L_L_Lk_J J- -LJ-J- _L J_J I J_J KJ UUJTOU J 


—0 009(S ( 1 QR\ 


-0.0032 


(1.41) 




HTCIT1H9AP f'RP741 DO'}'! 


X 


-0.0137 


(0.41)** 


X 


1J7 IN iT.^ 1 IN lVI_UUUUUtJ I 




0.0095 


(1.29) 


y/ 


TMPm (nm nm 

11V1JT Vjrl 1 IN lvl_UUltJUO 1 


—0 01 fift (9 ^fi^ 


-0.0116 


(0.81) 




MATN 1 ! ("MM 009^811 

1V1.T\__L IN O I IN 1V1_UU^001 1 


090fi (0 ftQ^** 


0.0301 


(0.36)*** 


o nofi^ ( i 9"^ 


dtw (nm norm ^ > 

XX-L 11 1 IN lvl_UUU01tJ 1 


—0 00^9 (c\ ft^ 1 ) 


-0.0177 


(0.74)** 




ArvVji I IN lvl_UUUtJOU 1 


— ft ft79ftp-0^ M ^fi 1 ! 

O.OliirOti UU ^l.OUy 




X 




SPNQA fNM nfl9Q77 , l 


0040 (C\ R7~\ 




X 


ft ^7ft^P-(14 ('\ 


v^yvVy IjtJ l IN 1V1_UU^c/c/4: 1 




0.0026 


(1.69) 




O I lOJ )v.i 1 \ ^XJlvl ( ZiOOO I J 


o m 9^ (Tt 9^*** 




X 


X 


lllO J_ lllOlJ I IN 1V1_UUU 1 1 U 1 


ys 


-0.0029 


(0.81) 




MADpn (RPft79'37 c ;'l 


n nm ^ f9 ^4"; 




X 




HTPA^ (NM (104091 ^1 

v.. 1 j \j 1 v • ) I IN 1V1_L'U4:C/^ 1 1 


o ni 79 (o ft^* 


0.0170 


(0 71)* 


01 71 (C\ ^4^1** 


SEMA3A (XM.376647) 


-0.0049 (1.15) 




X 


-3.8781e-05 (0.76) 


KIAA0861 (BX694003) 


-0.0261 (0.96)* 


-0.0181 


(0.74)** 


-0.0170 (0.58)** 


FSCN2 (NM.012418) 


0.0136 (1.25) 


0.0194 


(1.44) 


0.0058 (1.12) 


DKFZP566K0 (ALO50040) 


-0.0025 (1.36) 




X 


X 


MORC (BC050307) 


0.0204 (0.75)* 


0.0204 


(1.00)* 


0.0165 (0.94) 


C14orfl05 (AL01512) 


0.0021 (1.47) 




X 


X 


SAGE1 (NM.018667) 


X 


0.0012 


(2.56) 


X 


C6orfl03 (AL832192) 


X 


0.0023 


(1.20) 


X 


FLJ13841 (AK023903) 


0.0146 (0.35)** 


0.0146 


(0.49)** 


0.0129 (0.47)* 


FLJ22655 (BC042888) 


X 


0.0028 


(0.64) 


X 


FLJ21934 (AY358727) 


-0.0127 (0.55)* 


-0.0125 


(0.49)** 


-0.0079 (0.32)* 


KIAA1912 (AB067499) 


X 


-0.0013 


(0.65) 


X 


FLJ40298 (NM.173486) 


0.0307 (0.98)*** 


0.0316 


(1.20)** 


X 


MGC33951 (BC029537) 


X 


-0.0042 


(1.44) 


X 


NALP4 (AF479747) 


-0.0059 (0.56) 


-0.0059 


(0.76) 


-0.0062 (0.23)* 


FLJ46154 (AK128035) 


0.0185 (0.89)* 


0.0185 


(0.94)* 


0.0141 (1.58) 


MGC50372 (BX647272) 


5.3676e-04 (1.68) 




X 


X 


LOC285016 (XM.211736) 


0.0182 (2.56) 


0.0182 


(2.25) 


0.0147 (2.15) 



Superscripts ***, **, * are decodings of significance values. 



yield sign consistency of estimated coefficients among the selected gene sets. 
Note that the relative rankings of estimated corresponding coefficients are 
different among all methods. For example, gene FLJ40298 has the biggest 
absolute size in the SCAD penalty, it is ranked number 5 among those co- 
efficients produced by LASSO penalty and it is not even selected in the 
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MCP+ penalty. Interestingly, the common set of genes selected by LASSO 
and SCAD has very consistent estimated coefficients. For most genes MCP 
results in smaller estimated values than SCAD and LASSO. 

6. Discussion. We have studied penalized log partial likelihood methods 
for ultra-high dimensional variable selection for Cox's regression models. 
With nonconcave penalties, we have shown that such methods have model 
selection consistency with oracle properties even for NP-dimensionality. We 
have established that oracle properties hold with probability converging to 
one exponentially fast, and that the rate explicitly depends on the real and 
intrinsic dimensionality p and s, respectively. We have also developed an 
exponential inequality for deviations of a counting process from its compen- 
sator. Results for LASSO penalty were obtained as a special case. It con- 
firms explicitly that folded concave penalties allow for far weaker correlation 
structure than LASSO penalty. Furthermore, the asymptotic normality was 
proved, results of which can be used to construct confidence intervals of the 
estimated coefficients. 

SUPPLEMENTARY MATERIAL 

Supplementary material for "Regularization for Cox's proportional haz- 
ards model with NP-dimensionality" (DOI: 10.1214/11- AOS911SUPP; .pdf). 
In the Supplementary Material [Bradic, Fan and Jiang (2011)] we give ad- 
ditional results of our simulation study, we specify the statements and de- 
tailed proofs of technical Lemmas 2.1-2.3 and give complete proofs of Theo- 
rems 2.1, 4.1, 4.4-4.6. We present the details of the ICA algorithm of the Sec- 
tion 5 together with new simulation settings were we increased the censoring 
rate and/or increased the number of significant variables s, and with discus- 
sion on the relative estimation efficiency of the penalized methods. We de- 
velop results on the growth of the L2 norm of the score vector U n {(3\) and of 
the matrix JJ~ V(/3*, t) dM{t). Moreover we establish a result on the asymp- 
totic behavior of vector / 9 1 when s = o(n 1 / 3 ) diverging with n. The main 
tools used are the theory of martingales [Fleming and Harrington (1991)] 
and the results of various matrix norms of Lemmas 4.1, 4.2 and 2.1-2.3. 
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